Given first two weeks’ transactions, can we able to perdict a custom will have First Payment Default issue or not?
First two weeks’ transactions:
First Payment Default:
trans1 <- read.csv('Data/April Vintage Transactions page 1.csv', stringsAsFactors=FALSE)
trans2 <- read.csv('Data/April Vintage Transactions page 2.csv', stringsAsFactors=FALSE)
trans2$DebtDimId <- as.character(trans2$DebtDimId)
trans3 <- read.csv('Data/April Vintage Transactions page 3.csv', stringsAsFactors=FALSE)
trans3$DebtDimId <- as.character(trans3$DebtDimId)
trans4 <- read.csv('Data/April Vintage Transactions page 4.csv', stringsAsFactors=FALSE)
trans4$DebtDimId <- as.character(trans4$DebtDimId)
trans.dat <- bind_rows( bind_rows(trans1,trans2),bind_rows(trans3,trans4) )
nrow(trans.dat)
## [1] 197271
#head(trans.dat)
#summary(trans.dat)
profile1 <- read.csv('Data/April Vintage 2013.csv', stringsAsFactors = FALSE)
profile2 <- read.csv('Data/April Vintage 2013 _ page 2.csv', stringsAsFactors = FALSE)
prof.dat <- bind_rows(profile1, profile2)
nrow(prof.dat)
## [1] 173253
#head(prof.dat)
#summary(prof.dat)
rm(trans1, trans2, trans3, trans4, profile1, profile2)
The statement date of this card is fixed, which means some member will have a very short ‘first month’. Therefore, I used another way to define the First Payment Default:
Month on book less than 2
DisplayMinPay is greater than ActualMinPay
length(unique(prof.dat$DebtDimId))
## [1] 7039
length(prof.dat$DebtDimId[which(prof.dat$MOB==0)])
## [1] 6660
length(prof.dat$DebtDimId[which(prof.dat$MOB==1)])
## [1] 7031
MOB.0 <- prof.dat %>%
filter(MOB == 0) %>%
filter(DisplayMinPay > ActualMinPay)
MOB.1 <- prof.dat %>%
filter(MOB == 1) %>%
filter(DisplayMinPay > ActualMinPay)
nrow(MOB.0)
## [1] 167
nrow(MOB.1)
## [1] 2042
ProblemId <- unique(c(MOB.0$DebtDimId, MOB.1$DebtDimId))
length(ProblemId)
## [1] 2069
We consider time into our model as an very important factor. There are many way to using the transaction time data.
In this report, I’m using the day difference between member getting card and each transaction day as the time variable.
Most credit cards are required active when user recieve the card, however, this card is actived when being issued. Therefore, activation date does not means the date member recieve the card.
In our case, we are interested in how much risk a member will owe our money. If a member never using our card, it has a lower risk they owe our money - They may still owe the membership fee, but at least, they are not stealing money from us.
Therefore, I decide to use First transaction date to indicate the day a member get their card.
trans.dat$TimeDimDt <- as.Date(trans.dat$TimeDimDt, "%m/%d/%Y")
trans.first.date <- trans.dat %>%
group_by(DebtDimId) %>%
summarise(firstDate = min(TimeDimDt))
nrow(trans.first.date)
## [1] 9369
Compare with previous section, we can see there are more member in transaction data table than in user profile table.
Transaction day = Transaction date - Date of member getting card
trans <- trans.dat %>%
full_join(trans.first.date, by='DebtDimId') %>%
mutate(DayAfterGetCard = as.numeric(TimeDimDt - firstDate))
These account have been here for a while, we cannot find their first transaction
legacyId <- prof.dat %>%
group_by(DebtDimId) %>%
summarise(minMOB = min(MOB)) %>%
filter(minMOB > 1) %>%
select(DebtDimId)
trans <- trans %>%
filter(! DebtDimId %in% legacyId )
length(legacyId)
## [1] 1
f.2.trans <- trans[trans$DayAfterGetCard <= 14,]
f.2.TransCount <- f.2.trans %>%
group_by(DebtDimId) %>%
summarise(rowNum = n())
f.2.max <- max(f.2.TransCount$rowNum)
f.2.max
## [1] 70
I’m using transaction day (interger), transaction amount (numeric) and MCCcode (factor) as the training variables.
For the target variable, I made it up based on member FPD or not. For member who has FPD issue, I set the target variable as 999; for member who does not have FPD issue, I set the target variable as 0.
All data explorer below are based on the first 14 days transactions after getting the card.
MCC.GourpByCode <- f.2.trans %>%
group_by(MCCcode) %>%
summarise(MCCcategoryCount = n_distinct(MCCCategory), MCCDescriptionCount = n_distinct(MCCDescription))
max(MCC.GourpByCode$MCCcategoryCount)
## [1] 1
max(MCC.GourpByCode$MCCDescriptionCount)
## [1] 1
MCC.GroupByDescription <- f.2.trans %>%
group_by(MCCDescription) %>%
summarise(MCCcategoryCount = n_distinct(MCCCategory), MCCCodeCount = n_distinct(MCCcode))
max(MCC.GroupByDescription$MCCcategoryCount)
## [1] 2
max(MCC.GroupByDescription$MCCCodeCount)
## [1] 4
MCC.GourpByCategory <- f.2.trans %>%
group_by(MCCCategory) %>%
summarise(MCCDescriptionCount = n_distinct(MCCDescription), MCCCodeCount = n_distinct(MCCcode))
max(MCC.GourpByCategory$MCCDescriptionCount)
## [1] 55
max(MCC.GourpByCategory$MCCCodeCount)
## [1] 58
MCC.Code.to.Desc.Cate <- f.2.trans %>%
group_by(MCCcode) %>%
slice(1) %>%
ungroup()
mcc.feq <- f.2.trans %>%
group_by(MCCcode) %>%
summarise( MCCcodeFeq = n() ) %>%
select(MCCcode, MCCcodeFeq) %>%
left_join(MCC.Code.to.Desc.Cate, by='MCCcode') %>%
select(MCCcode, MCCcodeFeq, MCCDescription)
plot_ly( mcc.feq, labels = ~MCCDescription, values = ~MCCcodeFeq, textinfo = 'label+percent', type = 'pie') %>% layout(showlegend = FALSE)
### MCCcode transaction ammount ###
mcc.amount <- f.2.trans %>%
group_by(MCCcode) %>%
summarise( MCCcodeAmount = sum(TransactionAmt) )%>%
select(MCCcode, MCCcodeAmount) %>%
left_join(MCC.Code.to.Desc.Cate, by='MCCcode') %>%
select(MCCcode, MCCcodeAmount, MCCDescription)
plot_ly( mcc.amount, labels = ~MCCDescription, values = ~MCCcodeAmount, textinfo = 'label+percent', type = 'pie') %>% layout(showlegend = FALSE)
mcc.amount.pertrans <- mcc.feq %>%
select(MCCcode, MCCcodeFeq) %>%
left_join(mcc.amount, by='MCCcode') %>%
mutate(amountPerTrans = MCCcodeAmount / MCCcodeFeq)
plot_ly( mcc.amount.pertrans, labels = ~MCCDescription, values = ~amountPerTrans, textinfo = 'label+percent', type = 'pie') %>% layout(showlegend = FALSE)
day.feq <- f.2.trans %>%
group_by(DayAfterGetCard) %>%
summarise( DayFeq = n() ) %>%
arrange(DayAfterGetCard)
plot_ly( day.feq, x = ~DayAfterGetCard, y = ~DayFeq, type='bar')
day.amount <- f.2.trans %>%
group_by(DayAfterGetCard) %>%
summarise( DayAmount = sum(TransactionAmt) ) %>%
arrange(DayAfterGetCard)
plot_ly( day.amount, x = ~DayAfterGetCard, y = ~DayAmount, type='bar')
## Warning: Ignoring 1 observations
day.amount.pertrans <- day.feq %>%
left_join(day.amount, by='DayAfterGetCard') %>%
mutate(AmountPerTrans = DayAmount / DayFeq) %>%
arrange(DayAfterGetCard)
plot_ly( day.amount.pertrans, x = ~DayAfterGetCard, y = ~AmountPerTrans, type='bar')
## Warning: Ignoring 1 observations
### Day vs. amount - box plot_ly ###
plot_ly( f.2.trans, x = ~DayAfterGetCard, y = ~TransactionAmt, type = 'box')
## Warning: Ignoring 2 observations
length(unique(prof.dat$DebtDimId)) / length(ProblemId)
## [1] 3.402127
AllId <- unique(prof.dat$DebtDimId)
SafeId <- AllId[! AllId %in% ProblemId ]
MCC.Feq.Good <- f.2.trans %>%
filter( DebtDimId %in% SafeId ) %>%
group_by(MCCcode) %>%
summarise( MCCcodeFeqGood = n() )
MCC.Feq.FPD <- f.2.trans %>%
filter( DebtDimId %in% ProblemId ) %>%
group_by(MCCcode) %>%
summarise( MCCcodeFeqFPD = n() )
MCC.Feq.All <- full_join(MCC.Feq.Good, MCC.Feq.FPD, by='MCCcode')
MCC.Feq.All <- MCC.Feq.All %>%
select(MCCcode, MCCcodeFeqFPD, MCCcodeFeqGood ) %>%
left_join(MCC.Code.to.Desc.Cate, by='MCCcode')
plot_ly( MCC.Feq.All, x = ~MCCDescription, y =~MCCcodeFeqFPD, type='bar', name='FPD Accounts') %>%
add_trace(y = ~MCCcodeFeqGood, name = 'Good Accounts') %>%
layout(yaxis = list(title = 'Feq'), barmode = 'stack')
## Warning: Ignoring 63 observations
## Warning: Ignoring 25 observations
MCC.Feq.All <- MCC.Feq.All %>%
mutate(FeqAll = MCCcodeFeqGood+MCCcodeFeqFPD) %>%
mutate(RiskRatio = MCCcodeFeqFPD/FeqAll )
plot_ly( MCC.Feq.All, labels = ~MCCDescription, values = ~RiskRatio, textinfo = 'label+percent', type = 'pie') %>% layout(showlegend = FALSE)
MCC.Amount.Good <- f.2.trans %>%
filter( DebtDimId %in% SafeId ) %>%
group_by(MCCcode) %>%
summarise( MCCcodeAmountGood = sum(TransactionAmt) )
MCC.Amount.FPD <- f.2.trans %>%
filter( DebtDimId %in% ProblemId ) %>%
group_by(MCCcode) %>%
summarise( MCCcodeAmountFPD = sum(TransactionAmt) )
MCC.Amount.All <- full_join(MCC.Amount.Good, MCC.Amount.FPD, by='MCCcode')
MCC.Amount.All <- MCC.Amount.All %>%
select(MCCcode, MCCcodeAmountFPD, MCCcodeAmountGood ) %>%
left_join(MCC.Code.to.Desc.Cate, by='MCCcode')
plot_ly( MCC.Amount.All, x = ~MCCDescription, y =~MCCcodeAmountFPD, type='bar', name='FPD Accounts') %>%
add_trace(y = ~MCCcodeAmountGood, name = 'Good Accounts') %>%
layout(yaxis = list(title = 'Amount'), barmode = 'stack')
## Warning: Ignoring 64 observations
## Warning: Ignoring 25 observations
#### Pie Plot via ratio (Percentage of percentage) ####
MCC.Amount.All <- MCC.Amount.All %>%
mutate(AmountAll = MCCcodeAmountGood+MCCcodeAmountFPD) %>%
mutate(RiskRatio = MCCcodeAmountFPD/ AmountAll)
plot_ly( MCC.Amount.All, labels = ~MCCDescription, values = ~RiskRatio, textinfo = 'label+percent', type = 'pie') %>% layout(showlegend = FALSE)
Trade each transaction as a increment of chance to FPD.
Believe each transaction are individual.
Then we can fomulize the model as:
\[ChanceToFPD = \sum_i A_iC_i e^{(-C_t+t_i)}\]
In which, \(A_i\) denote the amount of transaction \(i\), \(C_i\) denote the affect ratio of MCC code of transaction \(i\), \(t_i\) denote the time of transaction \(i\) happens, \(C_t\) is a constant to centralize the data.
Pros: Easy to explain, easy to build.
Cons: 1. This model believe each transaction are individual, which may not correct 2. This model cannot take advantage in big data
CNN or RNN both able to handle this problem.
Common CNN are easier to image and impletment, but hard to explain and optimize.
RNN will more reasonable, but I need to learn how to build up such a RNN system.